OptiClust, an Improved Method for Assigning Amplicon-Based Sequence Data to Operational Taxonomic Units
نویسندگان
چکیده
Assignment of 16S rRNA gene sequences to operational taxonomic units (OTUs) is a computational bottleneck in the process of analyzing microbial communities. Although this has been an active area of research, it has been difficult to overcome the time and memory demands while improving the quality of the OTU assignments. Here, we developed a new OTU assignment algorithm that iteratively reassigns sequences to new OTUs to optimize the Matthews correlation coefficient (MCC), a measure of the quality of OTU assignments. To assess the new algorithm, OptiClust, we compared it to 10 other algorithms using 16S rRNA gene sequences from two simulated and four natural communities. Using the OptiClust algorithm, the MCC values averaged 15.2 and 16.5% higher than the OTUs generated when we used the average neighbor and distance-based greedy clustering with VSEARCH, respectively. Furthermore, on average, OptiClust was 94.6 times faster than the average neighbor algorithm and just as fast as distance-based greedy clustering with VSEARCH. An empirical analysis of the efficiency of the algorithms showed that the time and memory required to perform the algorithm scaled quadratically with the number of unique sequences in the data set. The significant improvement in the quality of the OTU assignments over previously existing methods will significantly enhance downstream analysis by limiting the splitting of similar sequences into separate OTUs and merging of dissimilar sequences into the same OTU. The development of the OptiClust algorithm represents a significant advance that is likely to have numerous other applications. IMPORTANCE The analysis of microbial communities from diverse environments using 16S rRNA gene sequencing has expanded our knowledge of the biogeography of microorganisms. An important step in this analysis is the assignment of sequences into taxonomic groups based on their similarity to sequences in a database or based on their similarity to each other, irrespective of a database. In this study, we present a new algorithm for the latter approach. The algorithm, OptiClust, seeks to optimize a metric of assignment quality by shuffling sequences between taxonomic groups. We found that OptiClust produces more robust assignments and does so in a rapid and memory-efficient manner. This advance will allow for a more robust analysis of microbial communities and the factors that shape them.
منابع مشابه
Defining DNA-based operational taxonomic units for microbial-eukaryote ecology.
DNA sequence information has increasingly been used in ecological research on microbial eukaryotes. Sequence-based approaches have included studies of the total diversity of selected ecosystems, studies of the autecology of ecologically relevant species, and identification and enumeration of species of interest for human health. It is still uncommon, however, to delineate protistan species base...
متن کاملIroning out the wrinkles in the rare biosphere through improved OTU clustering
Deep sequencing of PCR amplicon libraries facilitates the detection of low-abundance populations in environmental DNA surveys of complex microbial communities. At the same time, deep sequencing can lead to overestimates of microbial diversity through the generation of low-frequency, error-prone reads. Even with sequencing error rates below 0.005 per nucleotide position, the common method of gen...
متن کاملNUMERICAL TAXONOMIC STUDY OF THE IRANIAN SPECIES OF ALYSSUM L. BASED ON MORPHOLOGICAL CHARACTERS
The genus Alyssum L. belongs to the subtribe Alyssinae, tribe Alysseae and family Cruciferae (Brassicaceae). This genus is one of the largest genera of the family of Cruciferae in Iran, and seems to be the most problematic genus in which the boundary of certain species is not completely clear due to the polymorphism of morphological characters. The main objective of this research is to stud...
متن کاملFrom reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data
The development of high-throughput sequencing technologies has provided microbial ecologists with an efficient approach to assess bacterial diversity at an unseen depth, particularly with the recent advances in the Illumina MiSeq sequencing platform. However, analyzing such high-throughput data is posing important computational challenges, requiring specialized bioinformatics solutions at diffe...
متن کاملA novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies
Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational Taxonomic Units (OTUs) and therefore leading to the overestimation of microbial diversity. Sequencing errors will also result in OTUs that are not accurate rec...
متن کامل